A relatively new technique in systems biology is single-cell sequencing. Tiny nanoliter droplets are created using microfluidics that contain an individual cell and a bead that has a molecular barcode, used to uniquely label transcripts from that cell.
The volume of each droplet is 0.5 nanoliter. Cells are present at a constant concentration of 100 cells per microliter. What is the rate of cells included per droplet? (Hint: Remember to convert units!!!)
# Calculate the rate of the event (a cell being present in a droplet)
# Convert from ul to nl
cell.droplet.rate = _
print(paste("Rate is", _, "cells per 0.5 nanoliter droplet"))
## Error: <text>:3:21: unexpected input
## 2: # Convert from ul to nl
## 3: cell.droplet.rate = _
## ^
Plot the PMF for the Poisson distribution using this rate (the PMF is a PDF for a discrete distribution):
ggplot to plot the PMF vs. the number of cells.
# Calculate the probability mass function (across a range of 0:10 is sufficient)
cell.droplet.pmf=- _(_)
# Check the sum of the probability mass function
cell.droplet.pmf.sum = _
print(cell.droplet.pmf.sum)
# Create a data.frame with the data
cell.droplet.pmf.df <- data.frame(PMF = _, # the pmf data
N_Cells = _) # the range vector
# Make a density plot with the number of cells (x-axis) and PMF (y-axis).
# NOTE: ggplot evaluates the values of the x and y parameters as variables, so the
# column names below should NOT be enclosed in quotes.
cell.droplet.pmf.plot <- ggplot( _ , # the data.frame
aes( x = _ , # the column to plot on the x-axis
y = _ ) # the column to plot on the y-axis
) +
theme_classic() +
geom_line() +
# Use the `pretty_breaks()` function from the scales package to make the X-axis cleaner
scale_x_continuous(breaks=pretty_breaks())
print(cell.droplet.pmf.plot)
## Error: <text>:3:20: unexpected input
## 2: # Calculate the probability mass function (across a range of 0:10 is sufficient)
## 3: cell.droplet.pmf=- _
## ^
The experimentalist can control the concentration of cells in this experiment. Plot probability mass functions for experiments with 100 cells / microliter, 1000 cells / microliter, and 2000 cells / microliter.
# Calculate the probability mass function (across a range of 0:10 is sufficient)
cell.droplet.pmf.n100 <- _
cell.droplet.pmf.n1000 <- _
cell.droplet.pmf.n2000 <- _
# Create a data.frame with one column for each dataset and one for the range vector
q1c.data.frame <- data.frame(n100 = _ ,
n1000 = _ ,
n2000 = _ ,
N_Cells = _)
# look at the first few lines of the data.frame
_
# Melt the dataframe for ggplot, and specify the column names:
# 'id.vars' : row identifier; independent variable (will plot on x-axis)
# 'value.name' : data value; the dependent variable (will plot on y-axis)
# 'variable.name' : 'variable' to which datapoints belong; here, the datasets
q1c.data.frame.melted <- melt( _ , # data.frame
id.vars = "N_Cells", # x-axis
value.name = "Probability", # y-axis
variable.name = "Concentration") # dataset (plot key)
# look at the first few lines of the melted data frame
_
# Make a density plot, using Concentration as the key to color the three datasets.
# Note that we pass the groupings for the display to ggplot using the 'color' parameter.
q1c.plot <- ggplot( _ , aes( _, _,
color = Concentration )) +
theme_classic() +
geom_line() +
scale_x_continuous(breaks=pretty_breaks())
print(q1c.plot)
## Error: <text>:3:26: unexpected input
## 2: # Calculate the probability mass function (across a range of 0:10 is sufficient)
## 3: cell.droplet.pmf.n100 <- _
## ^
Each experiment produces 1e6 (a million) droplets. Plot the expected number of droplets with exactly one cell for each experimental concentration.
# Select only the data where N_Cells is 1 from the melted data frame
q1d.data.frame.melted <- filter( _ , _ )
# Using the *computed probabilities* and *the total number of droplets per experiment*,
# add a column named 'Expected_Droplets' with the expected number of droplets
# to 'q1d.data.frame.melted'
_ <- _
# Plot the data as a column plot
q1d.plot <- ggplot(_, # the data frame
aes(_)) + # x-axis, y-axis data
theme_classic() +
geom_col()
print(q1d.plot)
## Error: <text>:3:34: unexpected input
## 2: # Select only the data where N_Cells is 1 from the melted data frame
## 3: q1d.data.frame.melted <- filter( _
## ^
\(\Rightarrow\) Which concentration produces the largest number of droplets with one cell? Why might this be?
Your answer here
The optimal cell concentration maximizes the number of droplets with a single cell and minimizes the rate of ‘doublets’, which have 2 or more cells in a single droplet. Plot the number of droplets that have 1 cell and the number that have 2+ cells for each experimental concentration.
# Select the rows where N_Cells is > 1 from the melted data frame you created in Q1c.
q1e.data.frame.melted <- q1c.data.frame.melted[ _ , ]
# Calculate the sum of the probabilities for all N_Cells > 1 for each concentration.
n100.probability <- _(_[_, "Probability"])
n1000.probability <- _
n2000.probability <- _
# Create a new data.frame containing just the probabilities for N_Cells > 1.
# This will be in "melted" format since it is a subset of a melted data.frame.
# The data frame will have 3 rows for the 3 concentrations and their probabilities.
q1e.data <- data.frame(Concentration = _ , # a vector of category labels
Probability = _ , # a vector of probabilities
N_Cells = "2+" ) # this column contains the N_Cells label
# Add a column called 'Expected_droplets' with the *expected number of droplets*
# containing 2+ cells per experiment (Recall that each expermeint has 1 million cells)
q1e.data$_ <- _
# Row bind the data with the data for N Cells == 1
# (add the data in 'qe1.data' as rows to 'q1d.data.frame.melted')
q1e.plot.data <- _
# Plot the data as a column plot
q1e.plot <- ggplot(_, # the data frame you just made
aes(x = _, # x-axis: category labels
y = _, # y-axis: expected droplets
fill = _)) + # fill: group (key) by number of cells
theme_classic() +
geom_col(position="dodge") # plot side-by-side
print(q1e.plot)
## Error: <text>:3:49: unexpected input
## 2: # Select the rows where N_Cells is > 1 from the melted data frame you created in Q1c.
## 3: q1e.data.frame.melted <- q1c.data.frame.melted[ _
## ^
\(\Rightarrow\) Which experimental concentration has the lowest doublet rate? Which concentration would you choose for your experiment?
Your answer here
\(\Rightarrow\) In no more than one paragraph (5 sentences), explain why modeling this problem with the Poisson distribution was an appropriate choice.
Your answer here
Above, we simulated idealized data for this experiment for the purposes of illustration. If we could optically measure the number of cells per droplet in an actual experiment, then we could see whether the data actually fit the expected distribution.
Let’s sample some data from a Poisson distribution with the same parameters and compare it to the ideal data to see if they come from the same distribution. (Of course we expect that they will in this case, but knowing how to do this may come in handy in the future.)
# generate a sample of 1 million cells from a Poisson distribution with the expected
# rate constant for the n100 data
sample.rpois = rpois(_, lambda=_)
head(_)
# make a frequency table from the data
sample.rpois.table = _
sample.rpois.table
# combine the sampled and ideal data into a matrix
# row 1 = frequency table for sample data
# row 2 = probability vector for ideal PMF
# NOTE: the length of the sample table might not always be the same
# and will certainly be shorter than the length of the probability vector you made
# So you need to truncate it just to be sure
chisq.data = _(_,
q1c.data.frame$n100[1:length(sample.rpois.table)])
chisq.data
# do the chi-squared test
chisq.test( _ )
# you could also do the test by comparing two frequency vectors
# make a new matrix to compare obs and exp frequencies
# (multiply the probabilities by 1 million to get comparable numbers, and
# then round off to get integers)
chisq.data2 = _(_,
round(q1c.data.frame$n100[1:length(sample.rpois.table)] * 1e6))
chisq.data2
# now do the chi-squared test with the new matrix
chisq.test( _ )
# make a plot to compare these visually
barplot(chisq.data2, beside=T,
col=c("aquamarine3","coral"))
legend("topright", c("Observed","Expected Poisson"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
## Error: <text>:4:22: unexpected input
## 3: # rate constant for the n100 data
## 4: sample.rpois = rpois(_
## ^
\(\Rightarrow\) How do your sampled data compare to the ideal data in the Chi-squared test? Why would the two p-values not be the same?
ggplot function to print density plotsThis question is like Platform 9 and 3/4 from Harry Potter. It’s not a real question, but you’ll need it to get where you are going. ;-)
Notice that in Q1b and Q1c we used almost the same ggplot function to generate line plots. We are going to be doing a lot of the same below, so let’s make our life a little easier by wrapping this in our own function, so we don’t need to keep repeating the same code again and again. This will make our code look a little cleaner, and probably save us from spending extra time fixing up typos.
# ----------------------------------------------------------------------------------- #
# Function 'print.density.plot': create and print a density plot
#
# Parameters:
# - df: a data frame object
# - x: a string, the name of the df column to plot on the x-axis
# - y: a string, the name of the df column to plot on the y-axis (default="PMF")
# - group.by: an optional 'color' parameter for plotting >1 dataset
#
# Use the `pretty_breaks()` function from the scales package to make the X-axis cleaner
# ----------------------------------------------------------------------------------- #
# NOTE: Below we use the ggplot function 'aes_string', which accepts quoted strings
# instead of evaluating column names as variables directly (as does 'aes' above).
# This is really useful when you do not know in advance what the column names will be!
# ----------------------------------------------------------------------------------- #
print.density.plot <- function(df, x, y="PMF", group.by) {
my.plot = NULL # initialize the plot variable
# make a simple x-y plot
if (missing(group.by)) {
my.plot = ggplot(df, aes_string( x = x, y = y ) ) +
theme_classic() +
geom_line() +
scale_x_continuous(breaks=pretty_breaks())
}
# if we get passed a 'group.by' parameter for the 'color' aesthetic, use it
else {
my.plot = ggplot(df, aes_string( x = x, y = y, color = group.by ) ) +
theme_classic() +
geom_line() +
scale_x_continuous(breaks=pretty_breaks())
}
print(my.plot) # now print it!
}
# ----------------------------------------------------------------------------------- #
Expression of many genes is controlled at the translational level. One example is the stress response gene ATF4. In unstressed conditions, ribosomes bind to an ATF4 transcript at the 5’ end, and then translate an upstream Open Reading Frame (uORF), which is a short sequence that does not produce functional protein.
At the stop codon of this uORF, in unstressed conditions, the ribosome dissociates from the transcript 98% of the time. If the ribosome does not dissociate, it re-initiates translation and produces a functional ATF4 protein.
Recall that the negative binomial PMF describes the probability of \(r\) ‘successes’ after \(x\) ‘failures.’ The random variable for this distribution is \(X_r\), the number of unsuccessful trials before getting a desired number of successes.
\(\Rightarrow\) What is the probability of ‘success’ in this experiment?
Your answer here
Plot the PMF for the number of ribosomes that correctly dissociate from the ATF4 transcript before a ribosome erroneously produces one functional ATF4 protein molecule.
# Create the NB probability mass function over the range 0:200.
atf.unstressed.pmf <- _
# Create a data.frame with the data.
# The first column will contain the pmf data.
# The second column will contain the range vector.
atf4.unstressed.pmf.df <- data.frame(PDF = _,
Correctly_Dissociated = _)
# Make a density plot using our cool function from Q1.5 (the pmf goes on the y-axis).
# Note that we pass column names to our function as strings rather than as variable names.
print.density.plot(df=_, x="_", y="_")
## Error: <text>:3:23: unexpected input
## 2: # Create the NB probability mass function over the range 0:200.
## 3: atf.unstressed.pmf <- _
## ^
\(\Rightarrow\) This particular NB example may also be modeled by another distribution we talked about in class. What is it, and why? Explain.
Your answer here
Each ATF4 transcript must produce 10 functional ATF4 protein molecules to activate the cellular stress response program. Here we will examine the rate of activation of the cellular stress response program.
First, plot the PMF for the NUMBER of ribosome binding events that will occur before 10 functional ATF4 protein molecules are made.
# Create the probability mass function over the range 0:1000 (ribosome bindings)
atf4.response.activation.pmf <- _
# Create a data.frame with the data.
# Name the first column 'PMF' and store the pmf data in it.
# Name the second column 'Ribosome_Bindings' and store the range vector in it.
atf4.response.activation.pmf.df <- data.frame(_ = _,
_ = _)
# Make a density plot using our cool function from Q1.5 (the pmf goes on the y-axis).
# We don't have to specify the parameter names explicitly, as long as we pass them in
# the correct order! (df, x, y)
print.density.plot(_, "_", "_")
## Error: <text>:3:33: unexpected input
## 2: # Create the probability mass function over the range 0:1000 (ribosome bindings)
## 3: atf4.response.activation.pmf <- _
## ^
Ribosomes bind to the transcript at a rate of 1 every 10 seconds. Plot the probability mass function for the number of MINUTES it will take to activate the cellular stress response program.
# Add a Time column (in minutes) to this data frame.
_
# Make a density plot of the TIME to activate the response (the pmf goes on the y-axis).
# The function provides a default value for the function to plot on the y-axis (y="PMF"),
# so you don't even need to pass an argument for it unless you want to change it.
print.density.plot( _ )
## Error: <text>:2:1: unexpected input
## 1: # Add a Time column (in minutes) to this data frame.
## 2: _
## ^
ATF4 is also regulated post-translationally; it is degraded with a half-life of 20 minutes.
\(\Rightarrow\) Do you expect that the cellular stress response program is likely to activate by chance, due to leakage through the regulatory mechanisms? If the protein was stable, how would your answer change? Answer qualitatively (do not create any additional mathematical models). Answer in no more than one paragraph (5 sentences).
Your answer here
You add 4 micrograms per milliliter of tunicamycin to the culture, causing signalling pathways to phosphorylate eIF2alpha.
The ribosome now dissociates from the uORF 60% of the time and re-initiates to create a functional ATF4 protein 40% of the time. Plot the probability mass function for the TIME it takes to activate the stress response program in the presence of tunicamycin and in unstressed conditions (on the same plot).
# Create the probability mass function over the range 0:1000 (ribosome bindings)
# (It still takes 10 productive ATF protein molecules to activate the stress response)
atf4.stress.response.activation.pmf <- _
# Create a data.frame with the data.
# Ribosome_Bindings and Time will be numerical vectors.
# Condition will be an appropriate category label for this condition.
atf4.stress.response.activation.pmf.df <- data.frame(PMF = _,
Ribosome_Bindings = _,
Time = _,
Condition = _)
# Copy the unstressed data to a new data frame, and
# add a 'Condition' column for the unstressed condition.
atf4.unstress.response.activation.pmf.df <- atf4.response.activation.pmf.df
atf4.unstress.response.activation.pmf.df$_ <- _
# Row bind the unstressed and stressed data to create a full data set.
q2d.combined.data <- _
# Make a combined density plot using 'Time' as the independent variable.
# We will use ggplot's 'color' parameter as an aesthetic mapping for the 'Condition'.
# Our function passes this to 'ggplot' using its 'group.by' parameter.
print.density.plot(_ , _ , _ , group.by = _ )
## Error: <text>:4:40: unexpected input
## 3: # (It still takes 10 productive ATF protein molecules to activate the stress response)
## 4: atf4.stress.response.activation.pmf <- _
## ^
Your answer here